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Abstract 

Data in the form of pairwise comparisons arises in many domains, including preference elicitation, 
sporting competitions, and peer grading among others. We consider parametric ordinal models for 
such pairwise comparison data involving a latent vector w* € R d that represents the “qualities" of the 
d items being compared; this class of models includes the two most widely used parametric models- 
the Bradley-Terry-Luce (BTL) and the Thurstone models. Working within a standard minimax 
framework, we provide tight upper and lower bounds on the optimal error in estimating the quality 
score vector w* under this class of models. The bounds depend on the topology of the comparison 
graph induced by the subset of pairs being compared via its Laplacian spectrum. Thus, in settings 
where the subset of pairs may be chosen, our results provide principled guidelines for making this 
choice. Finally, we compare these error rates to those under cardinal measurement models and show 
that the error rates in the ordinal and cardinal settings have identical scalings apart from constant 
pre-factors. 

1. Introduction 

In an increasing range of applications, it is of interest to elicit judgments from non-expert humans. 
For instance, in marketing, elicitation of preferences of consumers about products, either directly 
or indirectly, is a common practice (Green et ah, 1981). The gathering of this and related data 
types has been greatly facilitated by the emergence of “crowdsourcing” platforms such as Amazon 
Mechanical Turk: they have become powerful, low-cost tools for collecting human judgments (Khatib 
et al., 2011; Lang and Rio-Ross, 2011; von Ahn et al., 2008). Crowdsourcing is employed not only for 
collection of consumer preferences, but also for other types of data, including counting the number 
of malaria parasites in an image of a blood smear (Luengo-Oroz et al., 2012); rating responses of 
an online search engine to search queries (Kazai, 2011); or for labeling data for training machine 
learning algorithms (Hinton et al., 2012; Raykar et al., 2010; Deng et al., 2009). In a different do¬ 
main, competitive sports can be understood as a mechanism for sequentially performing comparisons 
between individuals or teams (Ross, 2007; Herbrich et al., 2007). Finally, peer-grading in massive 
open online courses (MOOCs) (Piecli et al., 2013) can be viewed as another form of elicitation. 

A common method of elicitation is through pairwise comparisons. For instance, the decision 
of a consumer to choose one product over another constitutes a pairwise comparison between the 
two products. Workers in a crowdsourcing setup are often asked to compare pairs of items: for 
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Which image is more relevant 


for the search query 'INTERNET'? 



O O 


(a) Asking for a pairwise comparison. 


How relevant is this image for 
the search query 'INTERNET'? 



/100 


(b) Asking for a numeric score. 


Figure 1 . An example of eliciting judgments from people: rating the relevance of the result of a search 
query. 


instance, they might be asked to identify the better of two possible results of a search engine, as 
shown in Figure la. Competitive sports such as chess or basketball also involve sequences of pairwise 
comparisons. From a modeling point of view, we can think of pairwise comparisons as a means of 
estimating the underlying “qualities" or “weights" of the items being compared (e.g.. skill levels of 
chess players, relevance of search engine results, etc.). Each pairwise comparison can be viewed as 
a noisy sample of some function of the underlying pair of (real-valued) weights. Noise can arise 
from a variety of sources. When objective questions are posed to human subjects, noise can arise 
from their differing levels of expertise. In a sports competition, many sources of randomness can 
influence the outcome of any particular match between a pair of competitors. Thus, one important 
goal is to estimate the latent qualities based on noisy data in the form of pairwise comparisons. A 
related problem is that of experimental design: assuming that we can choose the subset of pairs to 
be compared (e.g., in designing a chess tournament), what choice will allows for the most accurate 
estimation? Characterizing the fundamental difficulty of estimating the weights will allow' us to make 
this choice judiciously. These tasks are the primary focus of this paper. 

In more detail, the focus of this paper is the aggregation from pairwise comparisons in a fairly 
broad class of parametric models. This class includes as special cases the two most popular models for 
pairwise comparisons—namely, the Thurstone (Case V) (Thurstone, 1927) and the Bradley-Terry- 
Luce (BTL) (Bradley and Terry, 1952: Luce, 1959) models. The Thurstone (Case V) model has been 
used in a variety of both applied (Sw r ets, 1973; Ross, 2007; Herbricli et ah, 2007) and theoretical 
papers (Bramley, 2005; Krabbe, 2008; Nosofsky, 1985). Similarly, the BTL model has been popular 
in both theory and practice (e.g., (Nosofsky, 1985; Atkinson et al., 1998: Koehler and Ridpath, 1982; 
Heldsinger and Humphry, 2010: Loew'eu et ill., 2012; Green et ah, 1981; Khairullah and Zionts, 
1987)). 

1.1 Some past work 

There Ls a vast literature on the Thurstone and BTL models, and we focus on those most closely 
related to our own w'ork. Negahban et al. (2012) provide minimax bounds for the BTL model in 
the special case of comparisons chosen uniformly at random. They focus on this case in order to 
complement their analysis of an algorithm based on a random w r alk. In their analysis, there is a gap 
between the acliievable rate of the MLE and the low’er bound. In contrast, our analysis eliminates 
this discrepancy and show's that MLE is an optimal estimator (up to constant factors) and achieves 
the minimax rate. In independent and concurrent work, Hajek et al. (2014) consider the problem 
of estimation in the Plackett-Luce model, w'hich extends the BTL model to comparisons of two or 
more items. They derive bounds on the minimax error rates under this model which are tight up 
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to logarithmic factors. In contrast, our results are tight up to constants and, as we emphasize in 
the following section, provide deeper insights into the role of the topology of the comparison graph. 
Jagabathula and Shah (2008) design an algorithm for aggregating ordinal data when the underlying 
distribution over the permutations is assumed to be sparse. Ammar and Shah (2011) employ a 
different, maximum entropy approach towards parameterization and inference from partially ranked 
data. Rajkumar and Agarwal (2014) study the statistical convergence properties of several rank 
aggregation algorithms. 

Our work assumes a fixed design setup. In this setup, the choice of which pairs to compare and 
the number of times to compare them is chosen ahead of time in a non-adaptive fashion. There 
is a parallel line of literature on “sorting" or “active ranking" from pairwise comparisons. For 
instance, Braverman and Mossel (2008) assume a noise model where the outcome of a pairwise 
comparison depends only on the relative ranks of the items being compared, and not on their actual 
ranks or values. On the other hand, Jamieson and Nowak (2011) consider the problem of ranking 
a set of items assuming that items can be embedded into a smaller-dimensional Euclidean space, 
and that the outcomes of the pairwise comparisons are based on the relative distances of these items 
from a fixed reference point in the Euclidean space. 

A recent line of work considers a variant of the BTL and the Thurstone models where the com¬ 
parisons may depend on some auxiliary unknown variable in addition to the items being compared; 
for instance, the accuracy of the individual making the comparison in an objective task. Chen et al. 
(2013) consider a crowdsourcing setup where the outcome depends on the worker’s expertise. They 
present algorithms for inference under such a model and present empirical evaluations. Yi et al. 
(2013) consider a problem in the spirit of collaborative filtering where certain unknown preferences 
of a certain user must be predicted based on the preferences of other users as well as of that user over 
other items. Lee et al. (2011) consider the inverse problem of measuring the expertise of individu¬ 
als based on the rankings submitted by them, and the proposed algorithms assume an underlying 
Thurstone model. 

1.2 Our contributions 

Both the Thurstone (Case V) and BTL models involve an unknown vector w' 6 corresponding 
to the underlying qualities of d items, and in a pairwise comparison between items j and k, the 
probability of j being ranked above k is some function F of the difference Wj — w£. The Thurstone 
(Case V) and BTL are based on different choices of F, and both belong to the broader class of models 
analyzed in this paper, in which F is required only to be strongly log-concave. 

With this context, the main contributions of this paper are to provide some answers to the 
following questions: 

• How does the ininimax error for estimating the weight vector w" in various norms scale with the 
problem dimension (the number of items) and the number of observations? 

— We derive upper and lower bounds on the minimax estimation rates under the model described 
above. Our upper/lower bounds on the estimation error agree up to constant factors: to the 
best of our knowledge, despite the voluminous literature on these two models, this provides 
the first sharp characterization of the associated minimax rates. Moreover, our error guarantees 
provide guidance to the practitioner in assessing the number of pairwise comparisons to be made 
in order to guarantee a pre-specified accuracy. 

• Given a budget of n comparisons, which pairs of items should be compared? 

— The bounds that we derive depend on the comparison graph induced by the subset of pairs that 
are compared. Our theoretical analysis reveals that the spectral gap of a certain scaled version 
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of the graph Laplacian plays a fundamental role, and provides guidelines for the practitioner on 
how to choose the subset of comparisons to be made. 

• When is it better to elicit pairwise comparisons versus numeric scores? 

— When eliciting data, one often has the liberty to ask for either cardinal values (Figure lb) or 
for pairwise comparisons (Figure la) from the human subjects. One would like to adopt the 
approach that would lead to a better estimate. One may be tempted to think that cardinal 
elicitation methods are superior, since each cardinal measurement gives a real-valued number 
whereas an ordinal measurement provides at most one bit of information. Our bounds show, 
however, that the scaling of the error in the cardinal and ordinal settings is identical up to 
constant pre-factors. As we demonstrate, this result allows for a comparison of cardinal and 
ordinal data elicitation methods in terms of the per-measurement noise alone, independent of 
the number of measurements and the number of items. A priori, there is no obvious reason for 
the relative performance to be independent of the number of measurements and items. 

Notation: For any symmetric matrix M of size (m x m), we will let Ai(M) < A 2 (M) < ■■■ < 
X m (M) denote its ordered eigenvalues. We will use the notation D|<l(Pi IIP 2 ) to denote the Kullback- 
Leibler divergence between the two distributions Pi and P 2 . For any integer m, we will let [m] denote 
the set {1 ,..., m }. 

2. Problem formulation 

We begin with some background followed by a precise formulation of the problem. 

2.1 Generative models for ranking 

Given a collection of d items to be evaluated, we suppose that each item has a certain numeric quality 
score, and a comparison of any pair of items is generated via a comparison of the two quality scores 
in the presence of noise. We represent the quality scores as a vector w* £ so item j £ [d] has 
quality score Wj. Now suppose that we make n pairwise comparisons: if comparison i £ [??.] pertains 
to comparing item a, with item b,, then it can be described by a differencing vector .r, £ R d , with 
entry n, equal to one, entry b, equal to — 1 , and the remaining entries set to 0 . 

With this notation, we study the problem of estimating the weight vector based on observing 
a collection of n independent samples ?/, £ { — 1 . 1 } drawn from the distribution 

P[?Vi = llxj,™*'] = p(——) for i £ [n], (Ordinal) 

where F is a known function taking values in [0,1]. Since the probability of item n, dominating b, 
should be independent of the order of the two items being compared, we require throughout that 

F(x) = 1 - F(-x). 

In any model of the general form (Ordinal), the parameter a > 0, assumed to be known, plays 
the role of a noise parameter, with a higher value of a leading to more uncertainty in the comparisons. 
Moreover, we assume that F is strongly log-concave in a neighborhood of the origin, meaning that 
there is some curvature parameter 7 > 0 such that 

log F(i)) > 7 for all t 6 [-2 B/a, 2 B/a). (1) 

Here the known parameter B denotes a bound on the ^ 00 -norm of the weight vector, namely 

||w"||oG < B. 
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As our analysis shows, a bound of this form is fundamental: the minimax error for estimating w* 
will diverge to infinity if we are allowed to consider models in which D is arbitrarily large (see 
Proposition 17 in Appendix G). Informally, this behavior is related to the difficulty of estimating 
very small (or very large) probabilities that can arise in the two models for large ||w*||oo- Note that 
any model of the form (ORDINAL) is invariant to shifts in that is. it does not differentiate between 
the vector w* and the shifted vector w* + 1, where 1 denotes the vector of all ones. Therefore, in order 
to ensure identifiability of w*. we assume throughout that (1. w*) = 0. We will use the notation VV/j 
to denote the set of permissible quality score vectors 

Wfl : = {tu € I IMloo - B ' and <1. U’> = 0}. (2) 

Both the Thurstone (Case V) model with Gaussian noise (Thurstone, 1927) and Bradley-Terry- 
Luce (BTL) models (Bradley and Terry, 1952: Luce, 1959) {ire special cases of this general set-up, 
as we now describe. 


Thurstone (Case V): This model is is a special case of the family (Ordinal), obtained by setting 

F{t) -Lw« e ~ u/2du ' (3) 

corresponding to the CDF of the standard normal distribution. Consequently, the Thurstone model 
can alternatively be written as making n i.i.d. observations of the form 

(Thurstone) 


Vi = sign< (xi, w*) + ( t >, for i € [n], 


where f, ~ A'(0, <j 2 ) is observation noise. It can be verified that the THURSTONE model is strongly 
log-concave (e.g., see (Tsukida and Gupta, 2011)). 


Bradley-Terry-Luce: The Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952; Luce, 
1959) is another special case in which 


m = 


i 

TT^‘ 


and hence 

P t* = 1 l*<' "’I = —- / f<>r ^ € W- 

It can also be verified that the BTL model is strongly log-concave. 


(BTL) 


Cardinal observation models: While our primary focus is on the pairwise-comparison setting, 
for comparison purposes we also analyze analogous cardinal settings where each observation is real 
valued. In particular, we consider the following two cardinal analogues of the Thurstone model. 
In the Cardinal model we consider, each observation i £ [??.] consists of a numeric evaluation ?/* € K 
of a single item, 


Vi = (wj, w*) + e, for i. € [n], (Cardinal) 

where u, in tills case is a coordinate vector with one of its entries equal to 1 and remaining entries 
equal to 0, and e , is independent Gaussian noise A r (0,<7 2 ). One may alternatively elicit cardinal 
values of the differences between pairs of items 


Vi = (xi, w ") + €i for i € [n], (Paired Cardinal) 

where 6, are i.i.d. A r (0,<r 2 ). We term this model the Paired Cardinal model. 


5 



2.2 Fixed design and the graph Laplacian 

We analyze the estimation error when a fixed subset of pairs is chosen for comparison. Of interest 
to us will be the comparison graph defined by these chosen pairs, with each pair inducing an edge 
in the graph. Edge weights are determined by the fraction of times a given pair is compared. The 
analysis in the sequel reveals the central role played by the Laplacian of tills weighted graph. Note 
that we are operating in a fixed-design setup where the graph is constructed offline and does not 
depend on the observations. 

In the ordinal models, the i th measurement is related to the difference between the two items 
being compared, as defined by the measurement vector x, £ R. d . We let X £ denote the 

measurement matrix with the vector xj as its * th row. The Laplacian matrix L associated with this 
differencing matrix is given by 

1=1 

By construction, for any vector v £ R^, we have i' 7 Lv = YLj k (vj — v k ) 2 , where Lj k is the fraction 
of the measurement vectors in which items (j, k) are compared. 

The Laplacian matrix is positive semidefinite, and has at least one zero-eigenvalue, corresponding 
to the all-ones eigenvector. The Laplacian matrix induces a graph on the vertex set {l,...,d}, in 
which a given pair j. k) is included as an edge if and only if Ljk =f= 0 , and the weight on an edge 
(j, k) equals L jk . We emphasize that throughout our analysis, we assume that the comparison graph 
is connected, since otherwise, the quality score vector w * is not identifiable. Note that the Laplacian 
matrix L induces a semi-norm 1 on R d , given by 

\W ~ v|| l : = \j (« “ v) T L(u - v). (5) 

We study optimal rates of estimation in this semi-norm, as well as the usual /^-norin. As will be 
clearer in the sequel the L semi-norm is a natural metric in our setup, and estimation in this induced 
metric can be done at a topology independent rate. The estimation error in the L semi-norm is 
closely related to the prediction risk in generalized linear models. It arises naturally when one is 
interested in predicting the probability of a certain outcome for a new comparison. 


3. Bounds on the minimax risk 

In this section, we state the main results of the paper, and discuss some of their consequences. 


3.1 Minimax rates in the squared L semi-norm 

Our first main result provides bounds on the minimax risk under the squared L semi-norm (5) in 
the pairwise comparison models introduced earlier. In all of the statements, we use ci,C 2 , etc. to 
denote positive numerical constants, independent of the sample size n, number of items d and other 
problem-dependent parameters. 

Apart from the parameter 7 , the bounds presented subsequently will depend on F through a 
second parameter defined as 

max F'(x) 

,. = ffjgWj _ (c , 

^ ' F(2B/a){\ F{2 B/<j))' 

I 11 the BTL and the Thurstone models, we have C : = ;.v> n.a\\\' {> l'c>n?nw 


1. A semi-norm differs from a norm in that the semi-norm of a non-zero element is allowed to be zero. 
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Theorem 1 (Bounds on minimax rates in L semi-norm) (a) For a sample size n > < 1 

any estimator w based on n samples from the ORDINAL model has Laplacian squared error lower 
bounded as 


d sup E 
u> 4 €W | 3 



w 


-T* 2 


d 

n 


(7a) 


(b) For any instance of the ORDINAL model with ''/-strong log-concavity and any w* € W/j, the 
maximum likelihood estimator satisfies the bound 


P 


[ll^wi - 


o CX^O 1 

w\\i > 


d l 


7 


n j 


< e 1 for all t > 1, 


and consequently 


sup E {Wwul-w'Wl 
tu*€Wc L 


< 


7 n 


(7b) 


The results of Theorem 1 characterize the minimax risk in the squared L semi-norm up to 
constant factors. The upper bounds follow from an analysis of the maximum likelihood estimator, 
which turns out to be a convex optimization problem. On the other hand, the lower bounds are 
based on a combination of information-theoretic techniques and carefully constructed packings of 
the parameter set W/?. The main technical difficulty is in constructing a packing in the semi-norm 
induced by the Laplacian L. See Appendix A for the full proof. 


3.2 Minimax rates in the squared ^-norm 

Let us now turn to optimizing the minimax risk under the squared Euclidean norm. Theorem 2 
below presents upper and lower bounds on this quantity. 


Theorem 2 (Bounds on minimax rates in ^ 2 " norm ) ( a ) For a sample size n > 1 . any 

estimator w based on n samples from the ORDINAL model has squared Euclidean error lower 
bounded as 


sup E ||u; —u >*||2 

U7*€W B L 


. O 2 f ,o 

> C- 2 i -— max |a", 


1 


max 
d'€{2,...,d} 


i A, ( „ 

= |0.99<f'J 


}■ 


(8a) 


(b) For any instance of the ORDINAL model with ''/-strong log-concavity and any w* € VV/j, the 
maximum likelihood estimator satisfies the bound 


sup E 
u;*€W B 


Wml-v'WI 




7 A 2 (L)n' 


(8b) 


See Appendix B for the proof of this theorem. As we describe in the next section, the upper and 
lower bounds on minimax risk from Theorem 2 to identify the comparison graph(s) that lead to the 
best possible minimax risk over all possible graph topologies. 

Figure 2 depicts results from simulations under the ThurSTONE model, depicting the squared £2 
error for the maximum likelihood estimator for various values of n and d. In the simulations, the true 
vector w* is generated bv first drawing a d-length vector uniformly at random from [—1, l] rf , followed 
by a scale and shift to ensure w* £ W/?. The n pairs are chosen uniformly (with replacement) 
at random from the set of ( 2 ) possible pairs of items. The value of 0 and B are both fixed to 
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Number of samples n 


a; 

"O 

a; 

u 

ui 

<v 

cc 


2 4 




C 2° 


9 2 

±T 


-2 


-6 


if/. - |: -■ ^ wt |-vj 



d-8 

¥ 4 d = 16 
t 4 d - 32 
♦ d - 64 




2 7 2 8 2 9 2 10 2 11 2 12 2 13 


Number of samples n 


(a) Error (b) Resettled error 

Figure 2. Simulation results under the THURSTONE model. The eomparison topology diosen here is 
the complete graph. 


be 1. Given the n samples, inference is performed via the maximum likelihood estimator for the 
THURSTONE model. Each point in the plots is an average of 20 such trials. 

The error in Figure 2 reduces linearly with n, exactly as predicted by our Theorem 2. For the 
complete graph, Theorem 2 thus predicts a quadratic increase in the error with d. As 

predicted, the error when normalized by -Jr in Figure 2 converges to the same curve for all values of 
d. 

Before concluding thus section, we also look at the Paired Cardinal model (Section 2.1), the 
cardinal analogue of the THURSTONE model. 


Theorem 3 (Bounds on minimax rates in fa-norm) For the PAIRED CARDINAL model, the 
minimax risk is sandwiched as 


cm o~ 


,tr(lt) 


< inf 


n 


sup 

'*€VV q 


E 


[ns - 


w 


< C.'»u O 


,tr(Lt) 


n 


0) 


The proof of Theorem 3 is available in Appendix C. 

We conjecture that the dependence of the squared io minimax risk under the ORDINAL models 
on the problem parameters n, d and the graph topology Is identical to that derived in Theorem 3 for 
the Paired Cardinal model, i.e., is proportional to ■ — . 


3.3 Extension to m -ary comparisons 

Suppose instead of eliciting pairwise comparisons, one can instead ask the workers to make compar¬ 
isons between more than two options. In particular, we assume that each sample is a selection of 
the item with the largest perceived quality among some in presented items. The setting of pairwise 
comparisons is a special case with m = 2. Recall from Theorem 2 that the minimum squared (o 
minimax risk in the pairwise comparison setting is of the order Our goal in this section is to 
bring the concept of multiple-item comparison under the same framework as the pairwise case, and 
via a generalization of our earlier theoretical analysis, understand how the error exponent depends 
on m. 
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Consider d items, where every item j £ [d] has a certain underlying quality score tv’ £ [— B. B\ . 
You obtain n samples, with each sample being a selection of the item with the largest perceived value 
among some m presented items. 

Consider (d x m) matrices E i,..., E„ such that for each i £ [n], the m columns of Ei are distinct 
unit vectors. The positions of the non-zero elements in the m columns of Ei represent the identities 
of the m items compared in the i th sample. One can visualize the choices of the items compared as a 
hyper-graph, with d vertices representing the d items and hyper-edge i £ [??.] containing the in items 
compared in observation i. 

Let Rjn be (m x m) permutation matrices representing m cyclic shifts in an arbitrary 

(but fixed) direction. Consider the observation model 

P (y i =j\w\E i ) = F((w-) T E i R j ) 

for all j £ [m], where F : [— B.B\ m —> [0,1] represents the probability of choosing the first among 
the m items presented. For every x £ [-B, B] m . F(x) is assumed to satisfy: 

• Shift-invariance: the probabilities depend only on the differences in the weights of the items 
presented, i.e, F(x) depends only on {x^ — x^}, 

• Strong log-concavity: V 2 (— logF(x)) >: H for some (mx m) symmetric matrix H with A 2 {H) > 
0. 

Note that the shift-invariance assumption implies 1 £ nullspace(V 2 (—log F(x))), thereby neces¬ 
sitating nullspace(//) = span(l) and Ai (H) = 0. One can also verify that the model proposed here 
reduces to the ORDINAL model of Section 2.1 when in = 2. 

For any hope of inferring the true weights w* , we must ensure that the comparison hyper-graph 
is “connected”, i.e., for every pair of items i.j £ [d], there must exist a path connecting item i and 
item j in the comparison hyper-graph. We assume this condition is satisfied. We also continue to 
assume that w * £ W/? :={«;€ R d \ ||H|oo < B , ( w , 1) =0}. 

The popular Plackett-Luce model falls in this class, as illustrated below. 


Example 1 (Plackett-Luce model (Plackett, 1975; Luce, 1959)) The Plackett-Luce model con¬ 
cerns the process of choosing an item from a given set. Specifically, given m items with quality scores 
J,..., w’ n respectively, the likelihood of choosing item i £ [777] under this model is given by 


2^=1 


=: F([wl...,w; n \). 


Every choice is made independent of all other choices. 

It is easy to verify that, the Plackett-Luce model satisfies shift invariance. We now show that it 
also satisfies strong log-concavity. A little algebra gives 


V 2 (- logF(x)) = l)diag(c*) - e*(e*f), 


where e x : = [e Xl • • • e Xm \ l . We will now derive a lower bound for the expression above. An application 
of the Cauchy-Schwarz inequality yields that for any vector t? £ R m , 


v T (e x (e x ) r )v < i7 T diag(e z ){e T , 1) l». 


with equality if and only if v £ span(l). It follows that A 2 <V 2 (- logF(.r))) > 0 for all x £ [ -B , B) m . 
Defining the scalar 3 : = min x€ [_/j gjm A 2 ( ((g 1 , l)diag(e x ) — e x (e x ) T )), on can see that setting 

H = (j(I — \\ T ) satisfies the strong log-concavity conditions. 
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Our goal is to capture the scaling of the minimax error with respect to the number of observations 
n, the dimension d of the problem, and the choice of the subsets compared {F,}[ n ]. It is well 
understood (Miller, 1956; Kiger, 1984; Shiffrin and Nosofsky, 1994; Saaty and Ozdemir, 2003) that 
humans have a limited information storage and processing capacity, which makes it difficult to 
compare more than a small number of items. For instance, Saaty and Ozdemir (2003) recommend 
eliciting preferences over no more, than seven options. Thus in this work we will restrict our attention 
to m = 0(1). Moreover, the amount of noise in the selection process also depends on the number of 
items m presented at a time: the higher the number, the greater the noise. We will thus not use a 
‘noise parameter o' in this setting, and assume the noise to be incorporated in the function F which 
itself is a function of m. 

Our results involve the Laplacian of the comparison graph, defined for the m-wise comparison 
setting as follows. Let L be an (d x d) matrix that depends on the choice of the comparison topology 
as 


L : = 




( 10 ) 


We will call L the Laplacian of the comparison hyper-graph. One can verify that when applied to 
the special case of m = 2, the matrix L defined in (10) reduces to the Laplacian of the pairwise- 
comparison graph defined earlier in (4). 

The following theorem presents our main results for the m-wise comparison setting. 

Theorem 4 For the ra-WISE model, the minimax risk is sandwiched as 


C M 


inf- F(z) 


77i 2 A m (//)supJVF(z)||J /t n 


- < inf 


sup E 
JU*€W B 


\w — w 


<C 3u - 


m 


SU P- IIV log F(z) 


I Id 


h(ny- 


n 


in the squared L semi-norm and as 


c-41 


inf 2 F(z) 


m.2A m (//)sup.,||VF( 2 )||2 /t 


d 2 r 

— < inf sup E ||u;-u;*|| 2 

n ® L 


< C\ u 


in the squared 1 2 norm. Here we assume n > <>, ‘ J, 1 ” 1 •~ t u . 2 


m 2 sup 2 ||V log F(z) HI d 2 
- for both the lower bounds, and. 


n t 


where the suprema and infima with respect to the parameter z are taken over the set [-B, B]' 


The proof of Theorem 4 is provided in Appendix D. Our results establish that the dependence of 
the squared L semi-norm and squared Euclidean minimax error on m occurs only as multiplicative 
pre-factors, and the error exponent is independent of m. Thus, if one follows the standard recommen¬ 
dation in the psychology literature Miller (1956); Kiger (1984); Shiffrin and Nosofsky (1994); Saaty 
and Ozdemir (2003) namely to choose m = 0(1)—then the best possible scaling of the squared 
L semi-norm mini max risk with respect to d and n is always d, that of the squared Euclidean min¬ 
imax risk is always and evenly spreading the samples across all possible choices of m items is 
optimal. Nevertheless, a more refined modeling and analysis is required to understand the precise 
tradeoffs governing the choice of the number m of items presented to the user. 


4. Role of graph topology 

We now return to the setting of pairwise comparisons. In certain applications, one may have the 
liberty to decide which pairs are compared. The results of the previous section demonstrated the role 
played by the Laplacian of the comparison graph in the estimation error. We now employ these results 
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to derive guidelines towards designing the comparison graph. Let us focus on the estimation error in 
the squared t 2 norm in the ordinal setting. As discussed earlier, we assume that the graph induced 
by the comparisons is connected. An application of Theorem 2 lets us identify good topologies for 
pairwise comparisons in the fixed-design setup. 

A popular class of comparison topologies is that of evenly distributed samples on an unweighted 
graph (e.g., (Negahban et al., 2012)). Consider any fixed, unweighted graph G = {V, E). We assume 
that the samples are distributed evenly along the edges E of G, and that the sample size n is 
sufficiently large. Using standard matrix concentration inequalities, it is straightforward to extend 
our analysis to the setting of random chosen comparisons from a fixed graph (see, for instance, 
Oliveira (2009)). Let V denote the Laplacian of G. We define the sealed Laplacian of G as 


L : = 



One can verify that the matrix L defined here is identical to what was defined in (4) in a more general 
context. In order to differentiate from L, we will term L as the regular Laplacian of the graph G. 


4.1 Analytical results 

Consider the ORDINAL model and the squared f 2 -norm as the metric of interest. We claim that 
in order to determine whether a given comparison graph achieves minimax risk (up to a constant 
pre-factor), it suffices to examine the eigen-spectrum of the scaled Laplacian matrix. In particular, 
we claim that: 


If the scaled Laplacian has a second smallest eigenvalue that scales as = ©(d), then the 

comparison graph is optimal, and leads to the smallest possible minimax risk, in particular one 

that scales as —. 

n 

Conversely, if the scaled Laplacian matrix has an eigen-spectrum satisfying 


S=o 


max 
d'€{2,. 


d' 


\,(L) 


( 11 ) 


then the associated estimation error is strictly larger than the minimax risk. In particular, this 
sub-optimality holds whenever d 2 = °( x 2 (L) )• 

In order to verify these claims, we note that by definition (4) of the Laplacian matrix, we have 


= ~ u ( x ‘ x T )= 2 - 
1=1 

It follows that X 2 (L) < ^=-p, i.e., that = 0(d). As we will see shortly, several classes of graphs 

satisfy = Comparing the lower bound of O(^j-) on the minimax risk ( 8 a) with the upper 

bound ( 8 b) gives the sufficient condition of = ©(d) for optimality, and the smallest minimax 

risk as ©( 77 -). The lower bound ( 8 a) now also gives the claimed condition for strict sub-optimality. 

In order to illustrate these claims, let us consider a few canonical classes of graphs, and study how 
the estimation error under the squared Euclidean norm scales in the ORDINAL model. The spectra 
of the regular Laplacian matrices of these graphs can be found in various standard texts on spectral 
graph theory (e.g., Brouwer and Haemers (2011)). 
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• Complete graph. A complete graph has one edge between every pair of nodes. The spectrum 

of the regular Laplacian of the complete graph is 0, d,..., d, and hence the spectrum of the scaled 
Laplacian L is 0, g^y, .... jrj- Substituting A 2 (L) = g—^ in Theorem 2b gives an upper bound 
of 0 ( £) the minimax risk, and Theorem 2 gives a matching lower bound. The sufficiency 

condition discussed above proves optimality. 

• Constant-degree expander. The spectrum of the regular Laplacian is 0, 0(d), 12(d),.... 12(d). 
Since the number of edges is 0(d), the spectrum of the scaled Laplacian equals 0,0( g), 12(g),..., 12(g). 
The evaluation of this class of graphs with respect to the minimax risk is identical to that of com¬ 
plete graphs, giving a lower and upper bound of 0 (^) on the minimax risk, and guaranteeing 
optimality. 

• Complete bipartite. The d nodes are partitioned into two sets comprising, say, mi and m 2 

nodes. There is an edge between every pair of nodes in different sets, and there are no edges 
between any two nodes in the same set. The eigenvalues of the regular Laplacian of this graph are 
0, m 2 ,..., m 2 , m.j,_mj, mi + mj. Since the total number of edges is inj/tio. the scaled Laplacian 

S -—-V-' S —---- 

m 1 — 1 m 2 — 1 

L has a spectrum 0, jjp—.srp , 7 ^ 7,7777 + 7777 . Suppose without loss of generality that mi > m 2 . 

m j — 1 mj — 1 

Also suppose that m 2 > 1 (the case of m 2 = 1 is the star graph discussed below). Then we have 
7777 < 7777 < 7777 + 7777 and that d > mi > Furthermore since mo > 1, the multiplicity of 7777 - 
in the spectrum of the scaled Laplacian is at least 1. Thus we have Xo(L) = 0(g). Theorem 2 
then gives lower and upper bounds on the minimax risk as 0 ( 77 ) and the sufficiency condition 
discussed above guarantees its optimality. 

• Star. A star graph has one central node with edges to every other node. It is a special case 
of the complete bipartite graph with mi = d — 1 and m 2 = 1. The spectrum of the regular 
Laplacian is 0,1,..., l,d. Since there are (d — 1) edges, the spectrum of the scaled Laplacian is 
0, Tpj..... Tpj. • Theorem 2 and the sufficiency condition discussed above imply that this class 
of graphs is optimal and is associated to a minimax risk of 0 (^). 

• Path. A path graph is associated to an arbitrary ordering of the d nodes with edges between 

pairs j and (J + 1) for every j € {1,.... d — 1}. The spectrum of the regular Laplacian is given by 
2(l - cos ( 77 )), 2 € {0, 1}, and that of the scaled Laplacian is thus g^y (l - cos ( Zj -)), i € 

{0,... , d — 1}. The relation (1 — cosx) = sin 2 ^ and the approximation sinx ~ x for values of x 
close to zero gives A 2 (L) = 0(^-). The minimax risk is thus upper bounded as O(^) and lower 
bounded as 12(^). This class of graphs is thus strictly suboptimal. 

• Cycle. A cycle is identical to a path except for an additional edge between node d and node 1. 
The spectrum of the regular Laplacian is given by 2(l — cos (^)), i € { 0 ,...,d — 1 }, and that of 
the scaled Laplacian is thus |(1 — cos(=Ji)), i € {0,... ,d - 1}. The relation (1 — cosx) = sin 2 J 
and the approximation sinx % x for values of x close to zero gives = 0(gir). The minimax 

d i'J 

risk is thus upper bounded as 0( —) and lower bounded as S2(^-). This class of graphs is thus 
strictly suboptimal. 

• Barbell. The nodes are partitioned into two sets of ^ nodes each, and there is an edge between 
every pair of nodes within each set. In addition, there is exactly one edge across the sets. The spec¬ 
trum of the regular Laplacian can be computed as 0,0(g), 0(d),..., 0(d). Since there are e(^) 
edges, the spectrum of the scaled Laplacian turns out to become 0, 0(^r), 0(g),..., 0(g), f2(g). 
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Applying the results derived earlier in the paper, we get that a lower bound of and an upper 
bound of 0( ~) on the minimax risk, thereby also establishing the sub-optimality of this class of 
graphs. 

• 2D Lattice. An (mi x m 2 ) lattice has d = nutno vertices arranged as a (mi x m 2 ) grid. Assume 

m 1 = ©(d) and m 2 = ©(d). This class of graphs can be written as a Cartesian product of a path 
graph of length mi and a second path graph of length m 2 . As a result, the spectrum of the scaled 
Laplacian is |(2 — cos(^y) — cos )), ie{o,...,mi-l},j€{0,...,m2-l}. Again, using the small angle 

approximation of the sinusoid, one can compute an upper bound on the minimax risk {is O (^-) 
and a lower bound of 12(^). We do not know at this point whether the 2D lattice minimizes the 
minimax risk. 

• Hypercube. Assume d = 2 m for some integer m. Representing each node as a distinct m-length 
binary vector, an edge exists between the nodes corresponding to any pair of vectors within a 
Hamming distance of one. The hypercube is an 771 -fold Cartesian product of a path with two nodes, 
and hence the regular Laplacian has an eigenvalue of 2 i with multiplicity (”*), for i € {0,..., m}. 
The scaled Laplacian has an eigenvalue of j ] 2i r j with multiplicity ("*), for i € (0,..., m}. A lower 

bound on the minimax risk is J2(^) and an upper bound is 0( d ^ d )- We do not know if the 
hypercube is optimal, our bounds do tell us that any sub-optimalitv is bounded by at most a 
logarithmic factor. 

Observe that the degree-A: expander requires 11 . > kd samples while the complete graph requires 
n > ( 9 ) samples, so in practical applications at least for small sample sizes we should prefer a 
low-degree expander. 

Finally, if the conjecture in Section 3.2 were true, namely that the £2 minimax risk scales as 
a 2 tr(L^)/n, then the condition t,r(L*) = ©(d 2 ) would be necessary and sufficient for optimality of 
a comparison graph with the scaled Laplacian L. Observe that the graphs designated as ‘optimal’ 
in the discussion above indeed satisfy this condition. On the other hand, the graphs established as 
strictly suboptimal have tr(L*) = f2(d’*). 

4.2 Experiments and simulations 

This section evaluates the dependence of the squared ^ 2 -error on the topology of the comparison 
graph. We consider the following five topologies: path, barbell, complete, expander and 2D-lattice. 
In order to form an expander graph, we used the Gabber-Galil construction (Gabber and Galil, 
1981). For any chosen graph topology, the n difference vectors are selected as one edge each chosen 
uniformly at random (with replacement) from the comparison graph. Recall that our theory predicts 
that the complete and expander graphs will perform the best, and that the line and dumbbell graphs 
will fare the worst. Also recall that our theory predicts the error will scale as ||u>* — 11 2 scales with 

n as I/t?. in the complete and expander topologies. 

4.2.1 Experiments on synthetic data 

This section describes simulations using data generated synthetically from the THURSTONE model. 
In the simulations, we first generate a quality score vector w* € W# using one of the procedures 
described below. Once w* is chosen, the n pairwise comparisons for any given topology are generated 
as follows. An edge is selected uniformly (with replacement) at random from the underlying graph, 
and the chosen edge determines the pair of items compared. The outcome of the comparison is 
generated as per the THURSTONE model with the chosen w* as the underlying quality score. Finally, 
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the maximum likelihood estimator for the Thurstone model is employed to estimate w*. Every 
point in the plots is an average across 40 trials. 

The following six procedures are employed to generated the true quality score vector w* in the 
six respective subfigures of Figure 3. 

(a) Gaussian: w* is drawn from the standard normal distribution A/(0,/). 

(b) Uniform: w* is drawn uniformly at random from the set [—1,1] </ . 

(c) Packing set for the path graph: We first choose a vector z as by setting a value of 0 in the 
first coordinate, a value — 1 in ^ of the other coordinates chosen uniformly at random, and a 
value 1 in the remaining coordinates. Letting L = U T AU denote the eigen-decomposition of 
the Laplacian matrix of the path graph, w* is set as U T A^z, where A* is the Moore-Penrose 
pseudoinverse of A. This generation process mimics a construction used to prove the lower 
bound in Theorem 2, and tailors the construction for the path graph. 

(d) Packing set for the barbell graph: The procedure is identical to that in (c), except that the 
Laplacian matrix used is that of the barbell graph. 

(e) Packing set for the complete graph: The procedure is identical to that in (c), except that the 
Laplacian matrix used is that of the complete graph. 

(f) Packing set for the star graph: The procedure is identical to that in (c), except that the 
Laplacian matrix used is that of the star graph. 

The vector w* generated in this procedure is then scaled and shifted to ensure w* (E W/j. The value 
of D find <7 are set as 1. 

Figure 3 plots the estimation error under various topologies of the comparison graph. Observe 
in the figure that the error is the lowest under the complete and the star graphs, and the highest 
under the barbell and the path graphs. In particular, the error consistently varies as &{d 2 /n) for the 
complete and star graphs this phenomenon holds even in plots (e) and (f) where the procedure to 
choose w" forms the worst case for the complete and star graphs respectively according to the proof 
of Theorem 2. On the other hand, the minimax error varies as il^/n) in the worst case for the 
path and the barbell graphs. Finally, observe that in the simulations, the (constant) multiplicative 
factors to the term in the error turn out to be rather small, in the range of 0 to 9. 

4.2.2 Experiments on MTurk 

In this section, we describe the results of experiments conducted on the popular Amazon Mechanical 
Turk (https://www.mturk.com/: henceforth referred to as “MTurk”) commercial crowdsourcing 
platform, evaluating the effects of the choice of the topology. MTurk is an online platform where 
individuals or businesses can put up a task, and any individual can log in and complete the tasks 
in exchange for a payment that is specified along with the task. In our experiments, each worker 
was offered 20 cents per completed task. A worker was allowed to do no more than one task in an 
experiment. Workers were required to answer all the questions in a task. Only those workers who had 
100 or more prior approved works and an approval rate of 95% or higher were allowed. Workers from 
any country were allowed to participate, except for the task of estimating distances between cities 
(for which only USA-based workers were permitted since all questions involved American cities). 

We conducted three experiments that required the workers to make ordinal choices. 

(a) Estimating areas of circles: In each question, the worker was shown a circle in a bounding box 
(Figure 5a), and the worker was required to identify the fraction of the box’s area that the 
circle occupied. 
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Rescaled error Rescaled error Rescaled error 

Ww-w* 11* nid 2 I liii— w* 11™ nfd* ||il»-u »*\\l nfS 



Number of items d Number of items d 


(a) Gaussian 


(b) Uniform 



Number of items d Number of items d 


(c) Packing set for the path graph 


(d) Packing set for the barbell graph 



(e) Packing set for the complete graph (f) Packing set for the star graph 

Figure 3. Estimation error under different topologies for different generative processes in the synthetic 
simulations. 
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(a) Area of circle (b) Age from photograph (c) City distances 

Figure 4: Estimat ion error under different topologies in the experiments conducted on MThrk. 


(b) Estimating age of people from photographs: The worker was shown photographs of people 
(Figure 5b) and was asked to estimate their ages. 

(c) Estimating distances between pairs of cities: Pairs of cities were listed (Figure 5c) and for each 
pair, the worker had to estimate the distance between them. 

For each experiment, we recruited 140 workers on MTurk, and assigned them to one of the 
five topologies uniformly at random. In this experiment and others involving aggregation of ordi¬ 
nal data from MTurk. the aggregation procedure follows maximum likelihood estimation under the 
THURSTONE model, and the estimator Ls supplied the best-fitting value of o obtained via 3-fold 
cross-validation. Each run of the estimation procedure employs the data provided by five randomly 
chosen workers from the pool of workers who performed that task. The entire data pertaining to 
these experiments is available on the first author’s website. 

Figure 4 plots the squared 1 2 estimation error for the three experiments under the five topologies 
considered. We see that the relative errors are generally consistent with our theory, with the com¬ 
plete graph exhibiting the best performance and the path graph faring the worst. On real datasets, 
model misspecification can in some cases cause the outcomes to differ from our theoretical predic¬ 
tions. Understanding the effect of model misspecification, especially on topology considerations, is 
an important, question we hope to address in future work. 


5. Cardinal versus ordinal measurements 

In this section, we compare two approaches towards eliciting data: a score-based “cardinal” approach 
and a comparison-based “ordinal" approach. I 11 a cardinal approach, evaluators directly enter nu¬ 
meric scores as their answers (Figure lb), while an ordinal approach involves comparing (pairs of) 
items (Figure la). 

There are obvious advantages and disadvantages associated with either approach. On one hand, 
the cardinal approach allows for very fine measurements. For instance, the cardinal measurements in 
Figure 1 can take any value between 0 and 100, whereas an ordinal measurement is binary. One might 
be tempted to go even further and argue that ordinal measurements necessarily give less information, 
for one can always convert a set of cardinal measurements into ordinal, simply by ordering the 
measurements by value. If this conversion were valid, the data processing inequality (Cover and 
Thomas, 2012), would then guarantee that estimators based on ordinal data can never outperform 
estimators based on cardinal data. However, this conversion assumes that cardinal and ordinal 
measurements suffer from the same type of statistical fluctuation. The following set of experiments 
show this assumption is false. 

5.1 Raw data from MTurk 

We conducted seven different experiments on MTurk to investigate the possibility of a “data- 
processing inequality" between the elicited cardinal and ordinal responses: Are responses elicited 
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Which circle is BIGGER? Who do you think is OLDER? 



How many words are 

misspelled in this paragraph? Which sound has a 

HIGHER frequency? 

a a 

o o 


I- 

Bui Bial is the beginrtng of a new story - the 
slay of the gradual reneual of a man. Uie 
slay ol his gradual reg an oration, of Ns pasing 
tram one world into another, oI his Inlialon Into 
a new oikncwi life Thai migtil be die subjeil 
at a now stay, but our present stay is ended. 


words are misspelled 

(d) 


Which pair of cities is farther 
away from each other? 

Charlotte San Francisco 
and and 

Boston Austin 

O O 


(c) 


Rate this tagline for a 
healthcare platform 

“.Sun/tic. I.isl hul sure cure " 

□ /10 


(f) 


Figure 5. Screenshots of the tasks presented to the subjects. For each task, only one version (cardinal 
or ordinal) is shown here. 


in ordinal form equivalent to data obtained by first eliciting cardinal responses and then subtracting 
pairs of items? Our experiments lead us to conclude that this is generally not the case: convert¬ 
ing cardinally collected data into ordinal (by subtracting pairs of responses) often leads to a higher 
amount of noise as compared to that in data that is elicited directly in ordinal form. 

The tasks were selected to have a broad coverage of several important, subjective judgment 
paradigms such as preference elicitation, knowledge elicitation, audio and visual perception and 
skill utilization. 

In addition to the three experiments described in Section 4.2.2, we conducted the following four 
experiments. 

(d) Finding spelling mistakes in text: The worker had to identify the number of words that were 
misspelled in each paragraph shown (Figure 5d). 

(e) Identifying sounds: The worker was presented with audio clips, each of which was the sound of 
a single key on a piano (which corresponds to a single frequency). The worker had to estimate 
the frequency of the sound in each audio clip (Figure 5e). 

(f) Rating tag-lines for a product: A product was described and tag-lines for this pioduct were 
shown (Figure 5f). The worker had to rate each of these tag-lines in terms of its originality, 
clarity and relevance to this product. 

(g) Rating relevance of the results of a search query: Results for the query ‘Internet’ for an image 
search were shown (Figure 1) and the worker had to rate the relevance of these results with 
respect to the given query. 

Note that the data collected for (a)-(c) here was different and independent of the data collected 
for these tasks in Section 4.2.2. 

The number of items d in the experiments ranged from 10 to 25. For each of the seven experiments, 
we recruited 100 workers, and assigned each worker to either the ordinal or the cardinal version of 
the task at random. Upon obtaining the data, we first reduced the cardinal data obtained from the 
experiments into ordinal form by comparing answers given by the subjects to consecutive questions. 
For five of the experiments ((a) through (e)), we had access to the “ground truth” solutions, using 
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Task 

Circle 

Age 

Distance 

Spelling 

Audio 

Tagline 

Relevance 

Error in Ordinal 

6% 

13% 


40% 

20% 

44% 

31% 

Std. dev. 

.23 

.33 

Hat 

.49 

.40 

.47 

.44 

Error in Cardinal 

17% 

17% 

20%, 

42% 

29%, 

42% 

35% 

Std. dev. 

.31 

.38 

.38 

.46 

.43 

.46 

.44 

Time in Ordinal 

98s 

31s 

84s 

316s 

66s 

251s 

105s 

Std. dev. 

21.1 

14.3 

62.1 

33.2 

11.1 

28.1 

13.1 

Time in Cardinal 

181s 

70s 

144s 

525s 

134s 

342s 

185s 

Std. dev. 

39.9 

33.1 

56.2 

46.0 

12.4 

44.6 

28.2 


Table 1. Comparison of the average amount of error when ordinal data is collected directly versus 
when cardinal data is collected and converted to ordinal. Also tabulated is the median time (in seconds) 
taken to complete a task by a subject in either type of task. 


which we computed the fraction of answers that were incorrect in the ordinal and the cardinal- 
converted-to-ordinal data (any tie in the latter case was counted as half an error). For the two 
remaining experiments ((f) and (g)) for which there is no ground truth, we computed the ‘error’ 
as the fraction of (ordinal or cardinal-converted-to-ordinal) answers provided by the subjects that 
disagreed with each other. It is important to note that in the experiments in this section, we did not 
run any estimation procedure on the data: we only measured the noise in the raw responses . The 
entire data pertaining to these experiments, including the interface seen by the workers and the data 
obtained from their work, is available on the first author’s website. 

The results are summarized in Table 1. If the cardinal measurements could always be converted 
to ordinal ones with the same noise level as directly eliciting ordinal responses, then it would be 
unlikely for the amount of error in the ordinal setting to be smaller than that in the cardinal setting. 
Table 1 shows that converting cardinal data to an ordinal form very often results in a higher (and 
sometimes significantly higher) per-sample error in the (raw) responses than direct elicitation of 
ordinal evaluations. Such an outcome may be explained by the argument that the inherent evaluation 
process in humans is not the same in the cardinal and ordinal cases: humans do not perform an 
ordinal evaluation by first performing cardinal evaluations and then comparing them (Barnett, 2003; 
Stewart et al., 2005). One can also see from Table 1 that the amount of time required for cardinal 
evaluations was typically (much) higher than for ordinal evaluations. One can thus assume that we 
will typically have the per-observation error in the ordinal case lower than that in the cardinal case. 
In particular, if we consider the Thurstone and the Cardinal models (introduced in Section 2.1), 
we can assume that o < o c . 

5.2 Analytical comparison of Cardinal versus Ordinal 

As discussed earlier, while cardinal measurements allow more flexibility in the range of responses, 
ordinal measurements contain a lower per-sample error. Ordinal measurements have additional 
benefits in that they avoid calibration issues that are frequently encountered in cardinal measure¬ 
ments (Tsukida and Gupta, 2011), such as the evaluators’ inherent (and possibly time-varying) 
biases, or tendencies to give inflated or conservative evaluations. Ordinal measurements are also 
recognized to be easier or faster for humans to make (Barnett, 2003; Stewart et al., 2005), allowing 
for more evaluations with the same amount of time, effort and cost. 

The lack of clarity regarding when to use a cardinal versus an ordinal approach forms the moti¬ 
vation of this section. Can we make as reliable estimates from paired comparisons as from numeric 
scores? How much lower does the noise have to be for comparative measurements to be preferred 
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over cardinal measurements? The answers to these questions will help in determining how responses 
should be elicited. 

In order to compare the cardinal and ordinal methods of data elicitation, we focus on a setting 
with evenly budgeted measurements. In accordance with the fixed-design setup assumed throughout 
the paper, we choose the vectors x, a priori. Suppose that n is large enough, and that in the ordinal 
case we compare each pair n /(^) times. In the cardinal case suppose that we evaluate the quality of 
each item n/d times. We consider the Claussian-noise models Thurstone and Cardinal introduced 
earlier in Section 2.1. In order to capture the fact that the amount of noise is different in the cardinal 
and ordinal settings, we will denote the standard deviation of the noise in the cardinal setting as er c , 
and retain our notation of a for the noise in the ordinal setting. In order to bring the two models 
on the same footing, we measure the error in terms of the squared fy-norm. 

Let 7 g and Q; denote the parameters 7 and £ (defined in ( 1 ) and ( 6 ) respectively) specialized to 
the Gaussian distribution. Define b f (< 7, B ) : = y, b u (a, B) : = and b(a, B) : = ^. 

Observe that b(, b u and b are independent of the parameters n and d. 

With these preliminaries in place, we now compare the minimax error in the estimation under 
the cardinal and ordinal settings. 


Proposition 5 Given a sample size n that is a multiple of d(d - \)b(<j,B), suppose that we observe 
each coordinate n/d times under the CARDINAL model. Then the minimax risk is given by 


inf sup e[||u» — w*\[\ 

tu*€W/j *- 



( 12 a) 


Similarly, if we 
sandwiched as 


is 


observe each pair n/( 2 ) times in the THURSTONE model , then the minimax risk i. 

( 12 b) 


o 2 b t ((j, B)— < inf sup E ||u) — w *|| 2 < (r 2 b u (a, B)—. 

n w u,*£W B n 


In the cardinal case, when each coordinate Ls measured the same number of times, the CARDI¬ 
NAL model reduces to the well-studied normal location model, for which the MLE is known to be the 
minimax estimator and its risk is straightforward to characterize (see Lehmann and Casella (1998) 
for instance). In the ordinal case, the result follows from the general treatment in Section 3. 

Let us now return to the question deciding between the cardinal and the ordinal methods of data 
elicitation. Suppose that we believe the Gaussian-noise models to be reasonably correct, and the 
per-observation errors a and a c under the two settings are known or can be separately measured. 
Proposition 5 shows that the scaling of the minimax error in the cardinal and ordinal settings is 
identical in terms of the problem parameters n and d. As an important consequence, our result 
thus allows for the choice to be made based only on the parameters (a, a c . B). and independent of n 
and d: the ordinal approach incurs a lower minimax error when b u (a, B)a 2 < a 2 while the cardinal 
approach is better off in terms of minimax error whenever bf(a.B)a 2 > a 2 . Establishing the exact 
decision boundary would require tightening the constants in the bounds, a task we leave for future 
work. 


5.3 Aggregate Estimation Error in Experiments on MTurk 

For the sake of completeness, we also computed the estimation error in the cardinal and ordinal 
settings. We consider data from the three experiments (c), (d) and (e ). 2 We normalize the true 

2. We restrict attention to these three experiments for the following reasons. There is no ground truth for experiments 
(f) and (g). In experiment (a), the size of each circle in each question is chosen independently from a continuous 
distribution, making all questions different and preventing aggregation. Experiment (b) employs a disconnected 
topology. 
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vector to have ||tu*|| 0o = 1 and set D = 1. For each of the three experiments, we execute 100 
iterations of the following procedure. Select five workers from the cardinal and five from the ordinal 
pool of workers uniformly at random. (The number five is inspired by practical systems (Wang et al., 
2011; Piech et al., 2013).) We run the maximum-likelihood estimator of the Cardinal model on the 
data from the five workers selected from the cardinal pool, and the maximum-likelihood estimator 
of the THURSTONE model on the data from the five workers of the ordinal pool. Note that unlike 
Section 5.1, the cardinal data here is not converted to ordinal. 


Task 

Spelling 

Distance 

Audio 


0.358 ± 0.035 

0.168 ± 0.026 

0.444 ± 0.055 


0.350 ± 0.045 

0.330 ± 0.028 

0.508 ± 0.053 

Kendall-tau coefficient in Ordinal 

0.277 ± 0.049 

0.547 ± 0.034 

0.513 ± 0.047 

Kendall-tau coefficient in Cardinal 

0.129 ± 0.046 

0.085 ± 0.038 

0.304 ± 0.049 


Table 2: Evaluation of the inferred solution from the data received from multiple workers. 

The results are tabulated in Table 2. To put the results in perspective of the rest of the paper, 
let us also recall the per-sample errors in these experiments from Table 1. Observe that among these 
three experiments, the per-sample noise in the cardinal data was closest to that in the ordinal data 
in the experiment on identifying the number of spelling mistakes. The gap was larger in the two 
remaining experiments. This fact is reflected in the results of Table 2 where the estimator on the 
cardinal data incurs a lower ^-error than the estimator on the ordinal data in the experiment on 
identifying the number of spelling mistakes, whereas the outcome goes the other way in the two 
remaining experiments. Our theory needs to tighten the constants in order to address this regime. 

G. Conclusions 

In this paper, we presented topology-aware minimax error bounds under a broad class of preference- 
elicitation models. We demonstrated the utility of these results in guiding the selection of comparisons 
and in guiding the choice of the elicitation paradigm (cardinal versus ordinal) when these options are 
available. One potential direction for future work would be to investigate improved data collection 
mechanisms, for instance adaptive schemes where we focus our effort on the most noisy comparisons. 
A second direction would be to characterize the precise thresholds for making the choice between the 
cardinal and ordinal approaches. Finally, the Thurstone and BTL models are parametric idealizations 
that have proved useful in a wide variety of applications. In future work we would like to investigate 
more flexible semi-parametric and non-parametric pairwise comparison models (see, for instance, 
Chatterjee (2014); Braverman and Mossel (2008)). 
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Appendix A. Proof of Theorem 1 

The following two sections prove the lower and upper bounds (respectively) on the minimax risk of 
Ordinal model under the squared L semi-norm. 
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A.l Lower bound 


Our lower bounds are based on the Fano argument, which is a standard method in minimax analysis 
(see for instance Tsybakov (2008)). Suppose that our goal is to bound the minimax risk of estimating 
a parameter w over an indexed class of distributions P = {P,„ | w € W} in the square of a pseudo¬ 
metric p. Consider a collection of vectors {?<.>’,..., ur'*} contained within W such that 


min p(w* iW k \ >S and 

i,fee(Ai| 

J** 


T^JT ^KL(Pu»il|Pm*) ^ £• 

\ 2) j,fc€|Af] 


We refer to any such subset as an (6, ^-packing set. 


Lemma 6 (Pairwise Fano minimax lower bound) Suppose that we can construct a (5, ft)-packing 
with cardinality M. Then the minimax risk is lower bounded as 


inf sup E \p(w, it’*) 2 



$ t 

log M ) 


(13) 


In order to apply Lemma 6, we need to a construct a suitable packing set. Given a scalar 
a € (0, ^) whose value will be specified later, define the integer 


M (a) : = exp j ^ (log 2 + 2a log 2a + (1 - 2a) log( 1 — 2a)) j J . 


(14) 


We require the following two auxiliary lemmas: 

Lemma 7 For any a € (0,^), there exists a set of M (a) binary vectors {z l ,..., 2 jV/ ( q *} C {0,1 } rf 
such that 


ad < || 2 * - 2 fc ||| < d 
<ei, z 3 ) = 0 


for all j ^ k 6 [ilf(a)], and 
for all j € [M(a)|, 


(15a) 

(15b) 


where e\ denotes the first canonical basis vector. 

This result is a straightforward consequence of the Gilbert-'Varshamov bound (Gilbert, 1952; Var- 
shamov, 1957). 

Lemma 8 For any pair of quality score vectors w* and w k , and for 

max F'(x 

_ *€{ 0,2 B/a) 

^ ' F(‘2B/a)(l F{2 B/(t)Y 

we have 


Dkl(P- 


n( 


0 < - Ul k ) T Hw 3 - w K ). 


W> 


<7 * 


(16) 


We prove this lemma at the end of this section. 


Taking these two lemmas as given for the moment , consider the set { 2 1 ,..., z M ^} of d-dimensional 
binary vectors given by Lemma 7. The Laplacian L of the comparison graph is symmetric and 


21 



positive-semidefinite, and so has a diagonalization of the form L = U T AU where U € W lxd is an 
orthonormal matrix, and A is a diagonal matrix of nonnegative eigenvalues. 

Letting A* denote the Moore-Penrose pseudo-inverse of A, consider the collection {w 1 ,..., } 

of vectors given by w } := -^IJ T Z~A} z 3 for each j £ [A/(o)]. Since 1 £ nullspace(L), we are 

guaranteed that (1, w 3 ) = -j=l T U T s/ltfz 3 = 0. On the other hand, 

(w> - w k ) T L(w’ - w k ) < '^-(z? - z k ) T \ffJULU r '/J(z> - z k ) 

a 

CO 

= —(z> - z k )Z~K)\\f\Uzi - z k ) 

a 

Here the last step makes use of the fact that the first coordinate of each vector z J and z k is zero. It 
follows that aS 2 < \\w 3 — w k \\\ < 6 2 . 

Setting S 2 : = 0.01^, we find that 


IK'lloc < ^II^At^'lb < ^ytr(AI)© <’ B, 

where inequality (i) follows from the fact that z 3 has entries in {0,1}; equation (ii) follows since 
= U T A^U by definition; and inequality (iii) follows from our choice of 6 and our assumption 
n > c ° 'jft‘ 1 on the sample size with c = 0.01. We have thus verified that each vector w 3 also 
satisfies the boundedness constraint ||u^ < B required for membership in W/j. Finally, observe 

that 

max D KL {F wj ||P I(A ) < mid min \\w 3 - w k \\ 2 L > aS 2 . 

a 1 


We have thus constructed a suitable packing set for applying Lemma 6, which yields the lower bound 



log A'/(a) /’ 


Substituting our choice of 6 and setting a = 0.01 proves the claim for d > 9. 

In order to handle the case d < 9, we consider the set of the three? d-length vectors given by 
z 1 = [0 • • • 0 - 1], z 2 = [0 • • • 0 1] and z 3 = [0 • • • 0 0]. Construct the packing set {w l , w 2 , u? 3 } 
from these three vectors { z 1 , z 2 , z 3 } as done above for the case of d > 9. From the calculations made 
for the general case above, we have for all pairs miny^. \\w 3 — and maxj j. \\w 3 — w k \\^ < 4 S 2 , 

and as a result maxjj. < . Choosing S 2 = ° ~ and applying Lemma 6 proves 

the theorem. 


The only remaining detail is to prove Lemma 8. 

Proof of Lemma 8: For any pair of quality score vectors w 3 and w k , the KL divergence between 
the distributions W w j and P u ,». is given by 


!*«,*) = J2 F (( wJ ' Xi)/a)\og 
i= 1 


F((w 3 , Xi)/a) 

F((w k . xi)!*) 




x i)M) ^ 


1 - F((w>, Xj)/g) 

1 “ F((w k , Xi)/cr) 
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For any a,b £ (0,1). we have the elementary inequality a log | < (a — 6)|. Applying this inequality 
to our expression above gives 


Okl(P^IIP^) < Xi)M - F((w k , s.)A>)) pff’ 

- {f((< x f >/<T)) - f«w\ 

< £ (w. **)/*) - n<®*, *«>/*))* 


F((«A, Xi )/ a ) 


T[ F«w*. x<)/<t)(1 -F((w", x,)/a)Y 

Since max{Halloo, |k fc ||oo} < B, and since F is a non-decreasing function, we have 

n /p ...p Fiiu^r,)/*))! 

kl( ^11 F(2B/ff)(l F(2B/<r)) 

Finally, applying the mean value theorem and recalling the definition of £ (from (6)) yields 

Dkl(P,,I|P^') < E<C(K *<>/» - <»*. *!>/»)* = 5 (u ' J “ wk ) T ' L ( wi - 

T = 1 


as claimed. 


A.2 Upper bound 

For the Ordinal model, the MLE is given by w £ arg min £(w), where 

u?€W c 

= 1] logF( -‘'J ' ) -I- 1 \yi = -l]log (l - F( ))}, and (17a) 

1=1 

W D := {w £ R d | (1, w) = 0, and |M|oo < B}. (17b) 

Our goal is to bound the estimation error of the MLE in the squared semi-norm \\v\\^ = v T Lv. 

For the purposes of this proof (as well as subsequent ones), let us state and prove an auxiliary 
lemma that applies more generally to ^/-estimators that are based on minimizing an arbitrary convex 
and differentiable function over some subset W of the set W. x . \={w£M. d | (1, w) = 0}. The MLE 
under consideration here is a special case. This lemma requires that £ is differentiable and strongly 
convex at w* with respect to the semi-norm || • || /,, meaning that there is some constant k > 0 such 
that 

£(w* + A) - t(w*) - (WK), A) > k\\A\\ 2 l (18) 

for all perturbations A £ If£ rf such that (u?* + A) £ VV. Finally, it is also convenient to introduce the 
semi-norm ||u||/ t = sffBTJu, where L* is the Moore-Penrose pseudo-inverse of L. 

Lemma 9 (Upper bound for ^/-estimators) Consider the M-estimator 

w £ arg min £(w ), where W is any subset of Woo, (19) 

w£W 

and £ is a differentiable cost function satisfying the k- strong convexity condition (18) at some w* £ W. 
Then 

l|fi-®*Ht<l||V<(w*)|| t t. (20) 

n- 


23 



Proof Since w and w* are optimal and feasible, respectively, for the original optimization problem, 
we have £(w) < £(w*). Defining the error vector A = w — w*. adding and subtracting the quantity 
(W(u/), A) yields the bound 

i(w m + A) - e(w*) - (V£(w*), A) < -(V£K), A). 

By the ^-convexity condition, the left-hand side is lower bounded by k||A||^. As for the right- 
hand side, note that A satisfies the constraint (1, A) =0, and thus is orthogonal to the nullspace 
of the Laplacian matrix L. Therefore, by Lemma 16 (in Appendix F), we have |{V£(u>* ).A>| < 
||W(u;*)||£t ||A||Combining the pieces yields the claimed inequality (20). ■ 


In order to apply Lemma 9 to the MLE for the Ordinal model, we need to verify that the 
negative log likelihood (17a) satisfies the strong convexity condition, and we need to bound the 
random variable || Vf(w*) ||^t defined in the dual norm || • ||^t- 

Verifying strong convexity: By chain rule, the Hessian of £ is given by 

i n 

V'-f(...) = — £{lfei = Ipi, + lfe, = -1]T, 2 } x iX T, 


where 

j2 + (i _ 

Tji : =-—--, and 7,2 : = -~--• 

n^) 2 (i - f (^)> 2 

Observe that the term T,\ is simply the second derivative of log F evaluated at ' 11 'J '' , and hence the 
strong log-concavity of F implies T,\ > 7 . On the other hand, the term T 2 is the second derivative 
of log(l — F). Since F(—x) = 1 — F(x) for all x, it follows that the function x i-» 1 — F(x) is also 
strongly log-concave with parameter 7 and hence T i2 > 7 . Putting together the pieces, we conclude 
that 

v T V 2 £(w)v > —^||Xv ||2 for all v,w £ Wb, 

where X £ R" x,/ has the differencing vector x, € as its i th row. 

Thus, if we introduce the error vector A : = w — «?*, then we may conclude that 

e(w- + A) - /(»•) - <V<(w*), A) > ^\\XA\\l = ±\\A\\l, 

showing that £ is strongly convex around w* with parameter k = A r . An application of Lemma 9 
then gives ||A||| < £||Vf( U) *)|||,. 

Bounding the dual norm: In order to obtain a concrete bound, it remains to control the quantity 
V£(w*) T L*V£(w*). Observe that the gradient takes the form 

WK) = =1 vrife. = _ 1|s , = _,i , T . 

na h [{ F (( w ’’ *i)/*) 1 * 1 - F((w', Xi)/ay'' 

Define a random vector V £ R" with independent components as 


. = [ W -P- nx. *■•>/*) 

' 1 w. P . 
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where £ is as defined in (6). Defining the n-dimensional square matrix M := -^pXlJX T , our 
definitions and previous bounds imply that ||A|i£ < V T MV. 

Consequently, our problem has been reduced to controlling the fluctuations of the quadratic form 
V 1 MV; in order to do so, we apply the Hanson-Wright inequality (see Lemma 13 in Appendix E). 
A straightforward calculation yields 

ll|M|lL = (rf-l)^2 “d |M| op = ^, 

where we have used the fact that L = ^X T X. Moreover, since the components of V are independent 
and of zero mean, a straightforward calculation yields that E[V T MV\ < E[||V'||5 to tr(M)] < . 

Since \Vi\ < the variables are £-sub-Gaussian, and hence the Hanson-Wright inequality implies 
that 

p[v t MV - > t] < 2oxp( - ^}) for all f > 0. 

Consequently, after some simple algebra, we conclude that 

p(||A||'i > < e~‘ for all f > 1, 

for some universal constant c. Integrating this tail bound yields the bound on the expectation. 


Appendix B. Proof of Theorem 2 

The following two sections prove the upper and lower bounds (respectively) on the minimax risk in the 
squared Euclidean norm for ORDINAL model. We prove the lower bound in two parts corresponding 
to the two components of the “max'’ in the statement of the theorem. 

B.l Upper bound 

The proof of the upper bound under the Euclidean norm follows directly from the upper bound 
under the L semi-norm proved in Theorem 1. From the setting described in Section 2, we have that 
the nullspace of the matrix L is given by the span of the all ones vector. Furthermore, we have 
( w * — w, 1) = 0, and ||u;* — > A 2 (L)||uj* — w\\ 2 - Substituting this inequality into the upper 

bound (7b) gives the desired result. 

B.2 Lower bound: Part I 

Since the Laplacian L of the comparison graph is symmetric and positive-semidefinite. By diagonal- 
ization, we can write L = U T AU where U € is an orthonormal matrix, and A is a diagonal 

matrix of nonnegative eigenvalues with A jj = A j(L). 

We first use the Fano method (Lemma 6) to prove that the minimax risk is lower bounded 
as For scalars a € (0, and S > 0 whose values will be specified later, recall the set 

(z 1 ,..., 2 ' v/ ^ q *} of vectors in the Boole<m hypercube (0,1}^ given by Lemma 7. We then define a 
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second set {w 3 ,j £ [M(q)]} via w 3 : = -*=U T Pz 3 , where P is a permutation matrix to be specified 
momentarily. At this point, the only constraint imposed on P is that it keeps the first coordinate 
constant. By construction, for each j k, we have ||u> J - ™ k \\l = *1** ~ z k \\l > where the 
final inequality follows from the fact that the set {z 1 ,..., z M ^} comprises binary vectors with a 
minimum Hamming distance at least ad. 

Consider any distinct j, k £ [A/(a)]. Then, for some { 21 ,..., t‘ r } C {2,..., d} with ad < r < d, it 
must be that 

IK - K-||| = uTpj - u T Pz >||| = - * k \\l = ^ E 

m= 1 

It follows that for some non-negative numbers 02 ,..., a d such that ad < (ii < d, 

(ih E IK' - Kill = ^-E"Kiti). 

\ 2 ) j*k i=‘2 

We choose the permutation matrix P such that the last (d — 1) coordinates are permuted to have 
a 2 > • • • > a,! and the d th coordinate remains fixed. With this choice, we get 

pL- £ |K' - K||| < < ^tr(L). 

Lemma (14) (Appendix F) gives the trace constraint tr(L) = 2. which in turn guarantees that 
Xw IK — For the choice of P specified above, we have for every j £ [M (or)], 

(1, up) = -^e^Pz 1 = eJzP = 0, 

Vd 

where the final equation employed the property (l*5b). 

Setting ^ = 0.01^^, we have llti^lloo < ^11^112 < < B. where inequality (i) follows from 

the fact that z 3 has entries in (0,1}; inequality (ii) follows from our choice of 6 and our assumption 
n > '"- on the sample size with c = 0.002, where Lemma 14 guarantees n > Ycb~ 2 • ^ ave 
thus verified that each vector w 3 also satisfies the boundedness constraint < B required for 

membership in W/}. 

From the proof of Theorem 1, we have that for any distinct Z>i< L (P^ ||P u ,a:) < £$|| w 3 - w k \\\, and 
hence 

^yE°KL(p^r rat )<g^-=o.oirf, 

where we have substituted our previous choice of 6. 

Applying Lemma 6 with the packing set {u.' 1 ,..., } gives 



M tl {0(V);p) > 



O.Old -F log 2 
log M (a) 


Substituting our choice of<5 and setting a = 0.01 proves the claim for d > 9. 

For the case of d < 9, consider the set of the three d-length vectors z 1 = [0 • • • 0 — 1], 

z 2 = [0 • • • 0 1] and z 3 = [0 • • • 0 0]. Construct the packing set {w 1 , w 2 , uj 3 } from these three 
vectors {z 1 , z 2 , z ‘} as done above for the case of d > 9. From the calculations made for the general 
case above, we have for all pairs miiij^fc \\w 3 — w k Hi > and maxj t k ||u^ — < 4 S 2 , and as a 

result max,.*- Dkl^wj ||Pw*) < 4n £f • Choosing S 2 = 17 ^ 2 and applying Lemma 6 yields the claim. 
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B.3 Lower bound: Part II 


Given an integer d! £ {2, ... , rf}, and scalars a £ (0, -j) find S > 0, define the integer 
M'(a) : = exp j ^(log 2 + 2a log 2a + (1 — 2a) log(l - 2a)) j j . 


( 22 ) 


Applying Lemma 7 with d' as the dimension yields a subset {z 1 ,..., z A/ * Q )} of the Boolean hypercube 
{0,l} rf with the stated properties. We then define a set of d-length vectors [w 1 ,..., w^ ! via 


up = [0 (z j ) T 0 • • • 0] r for each j £ [M(a)]. 


For each j £ [M(a)j, let us define up := -j^U T \/7J wK Now, letting ej £ R d denote the first 
standard basis vector, we have (1, uP) = 1 1 U 1 v^A*" w 3 = 0. where we have used the fact that 

1 £ nullspace(L). Furthermore, for any j k , we have 


IK - ®*lli = Jk - -J‘) t a'K - s*) t [ 


i=L(i-«W 


Thus, setting S 2 = 0.01^- yields 


IK'Hoc < ^ll'/AtaPlb < -L\/tr(At) ® "< B, 


where inequality (i) follows from the fact that z- 7 has entries in {0,1}; step (ii) follows because the 
matrices \ZA^ and ZU have the same eigenvalues; and inequality (iii) follows from our choice of 6 
and our assumption n > ' T 'jZ * on the sample size with c = 0.01. We have thus verified that each 
vector up also satisfies the boundedness constraint ||ti^||oo < B required for membership in W/j. 
Furthermore, for {my pair of distinct vectors in this set, we have 

IK - ®*lli = f K - **111 < a 2 - 


From the proof of Theorem 1, we D KL (F w} HP^*) < ^\\w^ -w k \\\ < O.Old'. Applying Lemma 6 with 
the packing set {u; 1 ,..., u; jV/ 0d} gives 


3 * n(w(P); 



0.01 +log 2 
log A/'(a) 


Substituting our choice of 6 and setting a = 0.01 proves the claim for d' > 9. 

For the case of d' < 9, we will show a lower bound of for a universal constant c > 0. This 

quantity is at least as large as the claimed lower bound. Consider the packing set of three d-length 
vectors w 1 = SUV A^[0 10 • • • 0] r , w 2 = -u; 1 and ic? 3 = [0 • • • 0] T for some S > 0. Then for 
every j ± k, one can verify that ||uP - w k ||| < 46 2 , ||u^ — w k \\% > Choosing S' 2 = <T ~^ 2 and 

applying Lemma 6 proves the claim for d! < 9. 

Finally, taking the maximum over all values of d' £ {2,..., d) gives the claimed lower bound. 


Appendix C. Proof of Theorem 3 

We now turn to the proof of Theorem 3 on the minimax rate for the Paired Cardinal model. Recall 
that this observation model takes the standard linear model, y = Xw* + (. where y £ R". w £ W 1 
and € ~ N(0,a 2 I). 
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C.l Upper bound under the squared L semi-norm 


The maximum likelihood estimate in the Paired Cardinal model is a special case of the general 
M-estimator (19) with £(w) : = «;))". For this quadratic objective function, it is 

easy to verify that the 7 -convexity condition holds with 7 = 1. (In particular, note that the Hessian 
of ( Is given by L = X T X/n.) 

Given the result of Lemma 9, it remains to upper bound ||V^(u/')||£f A straightforward com¬ 
putation yields ||W(w*)||| t = | T Q§ where Q := ^tXL)X t . Consequently, the random variable 
||W(itf*)||^tis quadratic form in the standard Gaussian random vector L. An application of Lemma 15 
(Appendix F) gives tr(Q) = ^-(d — l) and ||Q|| op = and then applying a known tail bound on 
Gaussian quadratic forms (see Lemma 12 in Appendix E) yields 


l|WK)|| 


Lt 


> 



< e~~ 


for all S > 0. 


Since d > 2, we 




2<r 2 d&- 

n 


for all 6 > 4, which yields 


fl|WK)|| 


L* 


> t 


4fj 2 di 


n j 


< e 1 for all t > 8 . 

< co 2 , from which the claim follows. 


Integrating this tail bound yields that E |j|W(u?*)|| 


C.2 Lower bound under the squared L semi-norm 

Based on the pairwise Fano lower bound previously stated in Lemma 6 , we need to construct a 
suitable («5, £)-packing, where the semi-norm p{w* i w k ) = || w 3 — is defined by the Laplacian. 
Given the additive Gaussian noise observation model, we also have 

D K L(P^||P, r *) = £z\\v? - 10*111, (23) 

The construction of the packing and the remainder of the proof proceeds in a manner identical to 
the proof of the lower bound in Theorem 1, except for the absence of the requirement of ||w J ||oo B 
on the elements {w J } of the packing set. 


C.3 Upper bound under the squared Euclidean norm 

The upper bound follows by direct analysis of the (unconstrained) least-squares estimate, which has 
the explicit form w = X T y, and thus 

E||u>-u;*||! = E||iL t X T £||l = a 2 tr(-^L'X T XL*) 

where we have used the fact that f ~ A'(0, cr 2 I n ). Since L = X T Xjn by definition, we conclude that 
E||u; — w *Hi = a ' as claimed. 

C.4 Lower bound under the squared Euclidean norm 

We obtain the lower bound by computing the Bayes risk with respect to a suitably defined (proper) 
prior distribution over the weight vector w ". In particular, if we impose the prior w* ~ A r (0, ^-L*), 
Bayes' rule then leads to the posterior distribution 

F(w I y; X) oc exp ^—^||y - Xw\\^j exp ^w r Lv^ 1 {(w, 1) = 0}. 
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Thus conditioned on ?y, w is distributed as N \(X T X + nL)~ l X T y,^-L^j . By applying iterated 
expectations, the Bayes risk is given by E||u? — ^L^X T y\\^ = ^tr(L^). which completes the proof. 


Appendix D. Proof of Theorem 4 

This section presents the proof of Theorem 4 for the setting of m-wise comparisons. We first state 
some simple properties of the model introduced in Section 3.3, which we will use subsequently in the 
proofs of the results. 

Lemma 10 The Laplacian of the underlying pairwise-comparison graph satisfies the trace constraints 
nullspace(L) = 1, A 2 (L) > 0 and tr (L) = m(m - 1). 

Lemma 11 For any j € [m], i € [n] and any vector v € M m , we have 

^fp-v T (mI - 11 > < v T R : HItJv < A '"" (//) » r ( m / - 11 >. 

See Section D.2 for the proof of these auxiliary lemmas. 


D.l Upper bound under the squared L semi-norm 

We prove this upper bound by applying Lemma 9. In this case, the rescaled negative log likelihood 
takes the form 

- n ni 

*w = ~ - EE 1 i yi = j 'i log f ( wTe ‘ r ’) • 

«'=i 3 =1 

and the MLE is obtained by constrained minimization over the set W/j := {i<? € | (1, w) = 

0, and ||u?||oo < B}. As in our proof of the upper bound in Theorem 1, we need to verify the 
Av-strong convexity condition, and to control the dual norm || Vf(u?*)||^t. 

Verifying strong convexity: The gradient of the negative log likelihood is 

n in 

vfM = - - E E ^ ! °s 

*=1 j =1 


The Hessian of the negative log likelihood can be written as 

- n rii 

V 2 e(w) = -EE%' = m R jV 2 l°S F(v)\ v=wTEtR .RjET. 

*=1 j = 1 

Using our strongly log-concave assumption on F, we have that for any vector z € 

n m 

z T V 2 i(w)z = ^ E 1[^ = 3]z TE i R jV 2 \ogF( v )\ v=wTEiR RjETz 

*=1 i= l 

-71 m 

>^EE 1 [m=i)z r E i R j HR T j Ejz 


^ - 11 T )ET Z , 

1 = 1 j= 1 
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where the last step follows from Lemma 11. The definition (10) of L implies that 

z T V 2 C(w)z > ^ ] -z T Lz = 

m m 

Consequently, the K-convexity condition holds around w* with k = An application of Lemma 9 

then yields 

IlftiL - W*lli < W(w*) T (24) 

Controlling the dual norm: The gradient of the negative log likelihood can then be rewritten 
as W(t£>*) = — 4 X^=i where each index i. £ [n], the random vector vector V{ £ R m is given 

by Vi : = l[.y« = j] Rj Vlog F({w*, Ei)Rj). Now observe that the matrix M : = / - ^ll r is 

symmetric and positive semi-definite with rank (m — 1), eigenvalues (1 ,..., 1,0 }, its nullspace equals 
the span of the all-ones vector, and that = M. Using this matrix, we define the transformed 
vector Vi : = (M*)?Vi for each i £ [n]. 

Consider a vector x and its shifted version x + <1, where t £ R and 1 denotes the vector of all 
ones. By the shift invariance property, the function g(t) = F(x + 1 1) — F(x) is constant, and hence 

s'(0) = <VF(x), 1) = 0, and 9 "(0) = (1, (V 2 F(*))l> = 0, (25) 

which implies that 1 £ nullspace(V 2 F(a:)). Furthermore, we have (Vlog F(x), 1 ) = ^(VFW, 1 ) = 
0. Consequently, {Vi, 1) = 0 = (V^, nullspace(M)). This allows us to write 

- »i j »i ii 

V£{w') = — and V tfV i(uf) = V?M?EjL'E e M*V t . 

71 i =i 71 i =i e=i 

By definition, for every pair > =£(.£ [n], V, is independent of Vf. Moreover, for every i £ [r?.], 

m 

nVi\ = E[(Af»)i Y. Mw = j]V logf (u)I v=(v ,. )rElBj ] 

J = 1 

111 

= (M^Yn^YEiR^RjVlosF^l^.^. 

J=l 

111 

= (A/*)i£*,VF(»)| r=(ll . )TE , R/ 

3 =1 

In order to further evaluate this expression, define a function g : M"' —> R as g(z) = T'Jj'-i F(z T Rj). 
Then by definition we have g{z) = 1. Taking derivatives, we get 0 = Vtf(z) = j R j VF{z r R J ). 
It follows that E [Vj] = 0, and hence that 

1 n n 

11 ,-l f = l 

= 4 ® l'jrvZ r M*Elf£SEiM*V£\ 

n >=i 

1 1 71 

< —Efsup \\V e \\l]tr(-'y MiETtfEiMi). 

11 t€[n] ~ n jr[ 
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Since L = % JZJLj E,MEj , we have tr(^ MI^EiM*) = as well as 

m 

ml = E M.'/, = j](VlogF( tJ )| v=(iij . )TEi „ j )^jA/ff,Vl„gF( ( ,)| u=(m . )TEifi/ 

3 =1 

Recalling the previously defined matrix M, observe that since R 3 is simply a permutation matrix, we 
have RjMRj = M for every j € [m]. Bv chain rule, we have (V log F(v), 1) = ^{VF^), 1) = 0, 
where the last step follows from our previous calculation. It follows that 

E[(WK), F'VfK))] < - sup ||VlogF(w)|||. 

Substituting this bound into equation (24) yields the claim. 


D.1.1 Lower bound under the squared L semi-norm 

For any pair of quality score vectors w 3 and w k , the KL divergence between the distributions P 
and P * is given by 


w> 


Dkl(P^ ||P ,„«0 = E E log 


F(w>'EjR,) 
i= i,=i F(w kT EiR t ) 

Applying the inequality log a: < x — 1 , valid for x > 0, we find that 




i=1 f=l 

Now employing the fact that \ F{w jT EjRi) = 1 F(w^ E,Ri) = 1 gives 

F{w> T EkRtf 


DKL(P«rf||P^)<EE 
2—1 /=1 
n m 

= EE 


=r fer v F ( wk E ’ R ') 

(F(vi ,T E iRi) - F(w kT E,R,)) 2 


- 2F(w> T E,Ri) + F(w kT E,R,)y 


1 = 1 Z =1 


F(w kT E,R l ) 


< 


1 


E E( f k' E - R <) - 


F(—B,““ 
1 


< F( - F FF) E E« VF <^)’ E < R ‘ - w «a» 2 . 


for some 2,7 € [— Z?, 2?] m . Letting £ = — !\ ! p>.p> ~B ] — an( l applying Lemma 16 (noting that 
(uP T EiRi, millspace(if)) = 0 for all i,j, l) gives 

n in 

Dkl(p^||p.O < E E ciK' r £ i R i - ^ t e , r ,\\1 

2=1 z= 1 

n m 

< c.(w> - w k ) T ( E E E > R t HR T E J) (y - u> k ) 


i= 1 /=1 

A-u2 


< CKn(H)n\\w J - w'\\l, 


(26) 
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where the final step is a result of Lemma 11. 

Consider the pair of scalars a £ (0, and S > 0 whose values will be specified later. Let M(a ) 
be as defined in (14). Consider the packing set {uj 1 ,..., } constructed in Appendix A.l. Each 

of these vectors is of length d , satisfies 1 ) = 0 , and furthermore, each pair from this set satisfies 

nS 2 < ||u^ - w k \\\ < S 2 . Setting S 2 = 0.01 nt - A f (//) yields 

Aa(P«JP,.0 < O.Old. 


Every element from the packing set also satisfies ||u^||oo < B when n > . and thus belongs 

to the class Wf?. 

Applying Lemma 6 yields the lower bound 


\w — U) 


& * 


d 


(H) 


I 1 - 


O.Old + log 2 'I 
log AJ (a) I 


Setting a = 0.01 proves the claim for d > 9. 

For the case of d < 9, consider the set of the three d-length vectors z l = [ 0 • • • 0 — 1], 

z 2 = [0 • ■ • 0 1] and z 3 = [0 • • • 0 0]. Construct the packing set { w l ,w 2 ,w 3 } from these three 
vectors {z 1 , z 2 . z 3 } as done above for the case of d > 9. From the calculations made for the general 
case above, we have for all pairs min^* ||u^ — w k \\ 2 l > 7 ^- and maxj.fc \\w^ - w k \\ 2 L < 4S 2 , and as a 
result maxj j. ||P«,*) < 4n(A m (H)6 2 . Choosing S 2 = 8n ,. l ^ //) and applying Lemma 6 proves 

the claim. 


D.1.2 Upper bound under the squared Euclidean norm 

The upper bound under the squared ^-norm follows directly from the upper bound under the squared 
L semi-norm in Theorem 4: noting that ( w * — w) X nullspace(L), we get that 

(w" — w) T L(w* — w) > A2(L)||uj* — 1DII2. 

Substituting this inequality in the upper bound on the minimax risk under the squared L semi-norm 
in Theorem 4 gives the desired result. 


D.1.3 Lower bound under the squared Euclidean norm 
su Pi€(-B.B) m l|VF(a )|| 2 


Define £ = 
w k £ W/J. 


FI B.B.....B) 


—. Equation (26) in Appendix D.1.1 shows that for any vectors 




Consider the pair of scalars a £ (0, ^) and <5 > 0 whose values will be specified later. Let M(a) be 
as defined in (14). In Appendix B.2 we constructed a set (ti? 1 ,..., «? jW ( Q )} of vectors of length d that 
satisfy (w*, 1) = 0 for every j £ [A/(a)], ;ind for every pair of vectors in this set, \\ud — u >*‘||2 > a<5“ 
«nd W™* ~ ™ k Wi ^ ^ tr ( L )- Applying Lemma 10 gives 

1 2 S 2 


(%), £ ^ ~ 
\ 2 ) j*k 


W 


A -||2 


I L Zi 


< ——m(m — 1 ). 


Setting S 2 = 0.005 


<P 


nCA„ 1 (//)m(m- 1) 


yields 


D K L(¥ wi \\F wk )<0.01d. 
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Iii a manner similar to Lemma 14 in the pairwise comparison case, one can show that in the general 
setting of this section, tr(L^) > —yy. Then, every element from the packing set also satisfies 

Halloo < B when S < B. which holds true under our assumption of n > <//) 

with c = 0.01. Each element of our packing set thus belongs to the class Wg. Applying Lemma 6 
yields the lower bound 




d 2 


( H)m(m — 1) 


-nl 1 - 


O.Olrf + log 2 1 
log M (a) I' 


Setting a = 0.01 proves the claim for d > 9. 

For the case of d < 9, consider the set of the three d-length vectors z l = [ 0 ••• 0 — 1], 

z 2 = [ 0 • • • 0 1] and z 3 = [0 • • • 0 0]. Construct the packing set { w l ,w 2 ,w 3 } from these three 
vectors {z 1 , z 2 . z 3 } as done above for the case of d > 9. From the calculations made for the general 
case above, we have for all pairs min^fc ||u;^ — w k \\ 2 > 7 ^ and max;,* || — u /'|| 2 < 4£ 2 , and as a 

result max j fc D K l(P u .J||^ fc ) ^ 4n(\ m (H)S 2 . Choosing S 2 = 8>It - l ^ //) and applying Lemma 6 proves 
the claim. 


D.2 Some implied properties of the model 

I 11 this section, we prove the two auxiliary lemmas stated at the start of this appendix. 

D.2.1 Proof of Lemma 10 
From the definition (10) of L. have 


L\ = -^Ei{mI-n T )Ej\ = - 11 T )1 = 0 , 

i=l i=l 

showing that 1 € nullspace(L). 

Now consider any non-zero vector v := [vi,..., vj\ T € such that v £ span(l). Then there 
must exist some i.j € [d] such that v, Vj. We know that there exists some path from item i to j 
in the comparison hyper-graph. Thus there must exist some hyper-edge in this path with two items, 
say i\j\ such that *>,< 7 ^ vy. Suppose that hyper-edge corresponds to sample £ € [n]. Let v' : = Ejv. 
Then v' £ span(l). The Cauchy-Schwarz inequality (v\ t/)(l, 1 ) > (( 1 /, l )) 2 thus implies 

v T E e (mI - 11 T )Ejv > 0. 

Furthermore, for any v" € M"\ the Cauchy-Schwarz inequality {»/', u")( 1 , 1 ) > ((?/', I )) 2 implies 
that for any i € [n], we have v T E, (ml — 11 T )E^[v > 0. Overall we conclude that have v 7 Lv > 0 for 
every v £ span(l), and hence, nullspace(L) = 1 and Xo(B) > 0 . 

Finally, we have 


| fi 1 

tr(L) = — tr(Ei(mI — \\ t )EJ ) = — ) — tr(E{ll T Ej )). (27) 

t=l i=l 

By the definition of the matrices {£«},e[ n p tr(EjE^) = m and tr(E,ll T E?) = m. Substituting these 
values in (27) gives the desired result tr(L) = m(m — 1). ■ 
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D.2.2 Proof of Lemma 11 


Let hi )..., h rn denote the m eigenvectors of H, with hi = —i—1. Then for any vector v' € R m , 

m m ni _ 

v' t Hv' = Y, k >) 2 > M a = (£(«', hi)* - ~{v\ l} 2 ) 

i—2 i—2 i =1 

= A 2 (H)v' t (I - i-ll r )t/, 

where the final step employed the property h{hj = I of the eigenvectors h\,... ,h m of H. A 

similar argument gives 

m m ni - 

v' T Hv' = YW){v', K? < A*,->* = A,„a X (H)(^<u'. fc,) 2 - l) 2 ) 

i =2 i =2 (=1 m 

= A max (ff)o' r (j-in 7 y- 

m 

Setting 1 / = Rjv gives 

A 2 {H)v t Rj( 1 ^-n T )Rjv < v T RjHRjv < \„ mx (H)v T R 3 (I - ±-U T )Rjv. 

Observe that the matrix J — ^ 11 T is invariant to permutation of the coordinates, and hence Rj(I — 
±U T )Rj = I-±U r . This gives 

- 11 t )d < v T R,HR' v < KuJH) v T (mI - ll r )ir. 
m ■ m 


Appendix E. Some useful tail bounds 

In this appendix, we collect a few useful tail bounds for quadratic forms in Gaussian and sub-Gaussian 
random variables. 

Lemma 12 (Tail bound for Gaussian quadratic form) For any positive semidefinite matrix Q 
and standard Gaussian vector g ~ N( 0, /,/), we. have 

W08 > t \M0) + vjQl, S ) 2 ] < e~ s/2 . (28) 

valid for all 6 > 0. 

Proof Note that the function g ►-* || \fQgW 2 is Lipschitz with constant ||\A7||o P - Consequently, 
by concentration for Lipschitz functions of Gaussian vectors (Ledoux, 2001), the random variable 
Z = WVQgh satisfies the upper bound 

p[ * 2 E[z i + ‘] ^('spaE 1 - 

By Jensen’s inequality, we have E [Z] = E[|| \fQgW 2 ] < Q<i\ = \/tr (Q). Setting t = x/|Q||„ p 6 

completes the proof. ■ 
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Lemma 13 ((Hanson and Wright, 1971; Rudelson and Vershynin, 2013)) Let V £ be. 

a random vector with independent zero-mean components that are sub-Gaussian with parameter K, 
and let M £ R dxd be an arbitrary matrix. Then there is a universal constant c > 0 such that 

P [\V T MV - E[V t MV\\ >t)< 2 exp (- C min{ y4| ^ , for all t > 0. (20) 

Appendix F. Properties of Laplacian matrices 

By construction, the Laplacian L of the comparison graph is symmetric and positive-semidefinite. 
By the singular value decomposition, we can write L = U T AU where U £ is an orthonormal 

matrix, and A is a diagonal matrix of nonnegative eigenvalues with A jj = A ; (L) for every j £ [d]. 
Given our assumption of Ai(L) < ■ • • < A,/(L), we also have An < • • • < A dd- Also recall that 
denotes the Moore-Penrose pseudo-inverse of L. In terms of the notation introduced, the Moore- 
Penrose pseudo-inverse is then given by Zd = U T AUJ, where A* is a diagonal matrix with entries 

A t f(V) ^ > « 

otherwise. 

The following pair of lemmas establish some useful properties about L. 

Lemma 14 The Laplacian matrix (4) satisfies the trace constraints 

tr(L) = 2, and tr(L^) > ^-. 

Proof From the definition (4) of the matrix L, we have tr(L) = I E? =1 = 2. We also 

know that Ai(L) = 0, and hence Ylj =2 A j(L) = 2. Given the latter constraint, the sum \ \l) 

Ls minimized when A 2 (L) = • • • = A d(L). Some simple algebra now gives the claimed result. ■ 


Lemma 15 For the matrix L defined in (4), and for a (n x d) matrix X with x f as its z th row, 

tr(?-x r L'x) = d- 1, \\?-x r L'x\\ fni = d- 1, and |||ix' r L , .r||| OJi = 1. 

Proof Let Q = ^x T L*x. Since L = ^X T X = U T AU, the diagonal entries of A are the squared 
singular values of Xjyjn. Consequently, there must exist an orthonormal matrix V such that 
X/\fn = V/AU 1 , and thus we can write Q = V\/A A* \/A V T . By definition of the Moore-Penrose 
pseudo-inverse, the matrix \/A A* \/A is a diagonal matrix; since the Laplacian graph is connected, 
its diagonal contains (d — 1) ones and a single zero. Noting that V is an orthonormal matrix gives 
the claimed result. ■ 


For future reference, we state and prove a lemma showing that these two semi-norms satisfy a 
restricted form of the Cauchy-Schwarz inequality: 

Lemma 16 For any two vectors u and v such that u _L nullspace(L) or/and v _L nullspace(L), we 
have 


l<«, »>| < |H| Lt \\v\\ L . 


(30) 
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Proof Since L = U T AU and O = U T A^U, we have 


Vv T Lv\/u T LUt = \/ v T U T AUv \/u T U T AUJu = ||^|| 211^112 > |(*>> “)|i 
where we have defined v : = \fAUv and u : = ZKJuu. Continuing on, 

(5?, u) = v T U T \fA\fAWu = v T UU T u , 

where we have used the fact that u or/and v are orthogonal to the null space of L. Since U is 
orthonormal, we conclude that (v. u) = (v, ?/.), which completes the proof. ■ 


Appendix G. Minimax risk without assumptions on quality scores 

The setting considered throughout the paper imposes two restrictions (2) on the quality score vector 
w*. The first condition is that of shift invariance, that is, (ir*, 1) = 0. The necessity of this condition 
for identifiability under the Ordinal model is easy to verify. The second condition is that the quality 
score vectors are ^-bounded, that is, ||i 0 “|| 3o < B for some finite B. In this section, for the sake of 
completeness, we show that the minimax risk is infinite in the absence of this condition. 

Proposition 17 Any estimator w based on n samples from the ORDINAL model (with unbounded 
quality seore vectors) has error lower bounded as 


sup E 
u>*GVV’ 


[||5i - tv* || 2 


sup E || 2/- uj* |||, 
L 


= oo. 


The remainder of this section is devoted to the formal proof of Proposition 17. Consider the event 
where for every comparison, the item with the higher quality score in w* wins. For any w* € VVoo\{0}, 
this event occurs with a probability at least Under this event, the true w* is indistinguishable 
from the quality score vector cw “ € V\4o for every c > 0, and the error is also unbounded. Since the 
probability of this event is strictly bounded away from zero, the expected error is also unbounded. 
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